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THE SUBSET ARGUMENT AND CONSISTENCY OF MLE IN 
GLMM: ANSWER TO AN OPEN PROBLEM AND BEYOND 

By Jiming Jiang 1 

University of California, Davis 

We give answer to an open problem regarding consistency of the 
maximum likelihood estimators (MLEs) in generalized linear mixed 
models (GLMMs) involving crossed random effects. The solution to 
the open problem introduces an interesting, nonstandard approach to 
proving consistency of the MLEs in cases of dependent observations. 
Using the new technique, we extend the results to MLEs under a gen- 
eral GLMM. An example is used to further illustrate the technique. 

1. Introduction. Generalized linear mixed models (GLMMs) have be- 
come a popular and very useful class of statistical models. See, for example, 
Jiang (2007), McCulloch, Searle and Neuhaus (2008) for some wide-ranging 
accounts of GLMMs with theory and applications. In the earlier years after 
GLMM was introduced, one of the biggest challenges in inference about these 
models was computation of the maximum likelihood estimators (MLEs). As 
is well known, the likelihood function under a GLMM typically involves in- 
tegrals that cannot be computed analytically. The computational difficulty 
was highlighted by the infamous salamander mating data, first introduced 
by McCullagh and Nelder [(1989), Section 14.5]. A mixed logistic model, 
which is a special case of GLMM, was proposed for the salamander data 
that involved crossed random effects for the female and male animals. How- 
ever, due to the fact that the random effects are crossed, the likelihood 
function involves a high-dimensional integral that not only does not have an 
analytic expression, but is also difficult to evaluate numerically [e.g., Jiang 
(2007), Section 4.4.3]. For years, the salamander data has been a driving 
force for the computational developments in GLMM. Virtually every nu- 
merical procedure that was proposed used this data as a "gold standard" 
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to evaluate, or demonstrate, the procedure. See, for example, Karim and 
Zeger (1992), Breslow and Clayton (1993), Drum and McCullagh (1993), 
McCulloch (1994), Breslow and Lin (1995), Lin and Breslow (1996), Jiang 
(1998), Booth and Hobert (1999), Jiang and Zhang (2001), Sutradhar and 
Rao (2003), and Torabi (2012). 

1.1. A theoretical challenge and an open problem. To illustrate the nu- 
merical difficulty as well as a theoretical challenge, which is the main objec- 
tive of the current paper, let us begin with an example. 

Example 1. A mixed logistic model was proposed by Breslow and Clay- 
ton (1993) for the salamander data, and has since been used [e.g., Bres- 
low and Lin (1995), Lin and Breslow (1996), Jiang (1998)]. Some alter- 
native models, but only in terms of reparametrizations, have been consid- 
ered [e.g., Booth and Hobert (1999)]. Jiang and Zhang (2001) noted that 
some of these models have ignored the fact that a group of salamanders 
were used in both the summer experiment and one of the fall experiments; 
in other words, there were replicates for some of the pairs of female and 
male animals. Nevertheless, all of these models are special cases of the 
following, more general setting. Suppose that, given the random effects 
Ui,Vj, G S, where S is a subset of I = ■ 1 <i <m,l < j < n}, 

binary responses yijk, € S, k = 1, . . . , Cy are conditionally independent 
such that, with p ijk = P(yijk = l\u, v), we have logit(p ijfc ) = x' ijk ft + Ui + Vj, 
where logit(p) = log{p/ (1 — p)},p € (0, 1), Xijfc IS £111 known vector of covari- 
ates, ft is a unknown vector of parameters, and u, v denote all the random 
effects Ui and Vj that are involved. Here Cj,- is the number of replicates for the 
cell. Without loss of generality, assume that S is a irreducible subset 
of I in that m, n are the smallest positive integers such that S C X. Further- 
more, suppose that the random effects tt,'s and Vj's are independent with 
Ui ~ N(0,a 2 ) and Vj ~ iV(0,T 2 ), where <t 2 ,t 2 are unknown variances. One 
may think of the random effects Ui and Vj as corresponding to the female 
and male animals, as in the salamander problem. In fact, for the salamander 
data, Cjj = 2 for half of the pairs (i, j), and Cjj = 1 for the rest of the pairs. It 
can be shown [e.g., Jiang (2007), page 126; also see Section 4 in the sequel] 
that the log-likelihood function for estimating ft, a 2 ,r 2 involves an integral 
of dimension m + n, which, in particular, increases with the sample size, and 
the integral cannot be further simplified. 

The fact that the random effects are crossed, as in Example 1, presents 
not only a computational challenge but also a theoretical one, that is, to 
prove that the MLE is consistent in such a model. In contrast, the situation 
is very different if the GLMM has clustered, rather than crossed, random 
effects. For example, consider the following. 
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Example 2. Suppose that, given the random effects ui, . . . ,u m , binary 
responses yij,i = 1, . . . , m, j = 1, . . . , ni are conditionally independent such 
that, with pij = P(yij = l\u), we have logit(pjj) = + Uj, where Xij is 
a vector of known covariates, j3 a vector of unknown coefficients, and u = 
(ui)i<i<m- Furthermore, suppose that the m's are independent with m ~ 
N(0,a 2 ), where a 2 is unknown. It is easy to show that the log-likelihood 
function for estimating (3, a 2 only involves one-dimensional integrals. Not 
only that, a major theoretical advantage of this case is that the log-likelihood 
can be expressed as a sum of independent random variables. In fact, this is 
a main characteristic of GLMMs with clustered random effects. Therefore, 
limit theorems for sums of independent random variables [e.g., Jiang (2010), 
Chapter 6] can be utilized to obtain asymptotic properties of the MLE. 

Generally speaking, the classical approach to proving consistency of the 
MLE [e.g., Lehmann and Casella (1998), Chapter 6; Jiang (2010)] relies on 
asymptotic theory for sum of random variables, independent or not. How- 
ever, one cannot express the log-likelihood in Example 1 as a sum of random 
variables with manageable properties. For this reason, it is very difficult to 
tackle asymptotic behavior of the MLE in the salamander problem, or any 
GLMM with crossed random effects, assuming that the numbers of random 
effects in all of the crossed factors increase. In fact, the problem is difficult 
to solve even for the simplest case, as stated in the open problem below. 

Open problem [e.g., Jiang (2010), page 541]' Suppose that Xij k /3 — fj,, an 
unknown parameter, cy = 1 for all i,j, S = I, and o~ 2 ,t 2 are known, say, 
a 2 — t 2 — 1 in Example 1. Thus, \x is the only unknown parameter. Suppose 
that m,n^> oo. Is the MLE of fi consistent? 

It was claimed [Jiang (2010), pages 541, 550] that even for this seemingly 
trivial case, the answer was not known but expected to be anything but 
trivial. 

1.2. Origination of the open problem. The problem regarding consis- 
tency of the MLE in GLMMs with crossed random effects began to draw 
attention in early 1997. It remained unsolved over the past 15 years, and 
was twice cited as an open problem in the literature, first in Jiang [(2007), 
page 173] and later in Jiang [(2010), page 541]. The latter also provided the 
following supporting evidence for a positive answer [Jiang (2010), page 550]. 

Let k = m A n. Consider a subset of the data, ya, i = 1, . . . , k. Note that 
the subset is a sequence of i.i.d. random variables. It follows, by the standard 
arguments, that the MLE of \i based on the subset, denoted by /i, is consis- 
tent. Let fi denote the MLE of fi based on the full data, y%j,i = 1, . . . ,ni,j = 
1, . . . ,n. The point is that even the MLE based on a subset of the data, ft, 
is consistent; and if one has more data (information), one is expected to do 
better. Therefore, ft has to be consistent as well. 
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1.3. The rest of the paper. In Section 2, we give a positive answer to the 
open problem as well as the proof. Surprisingly, the proof is fairly short, 
thanks to a new, nonstandard technique that we introduce, known as the 
subset argument. Using this argument, we are able to establish both Cramer 
(1946) and Wald (1949) types of consistency results for the MLE. It is fas- 
cinating that a 15-year-old problem can be solved in such a simple way. 
The new technique may be useful well beyond solving the open problem — 
for proving consistency of the MLE in cases of dependent observations. We 
consider some applications of the subset argument in Section 3 regarding 
consistency of the MLE in a general GLMM. An example is used in Sec- 
tion 4 to further illustrate the new technique. Remark and discussion on a 
number of theoretical and practical issues are offered in Section 5. 

2. Answer to open problem. Throughout this section, we focus on the 
open problem stated in Section 1. Let fj, denote the true parameter. 

Theorem 1 (Cramer consistency). There is, with probability tending to 

p 

one, a root to the likelihood equation, £i, such that fi — > fi. 

Proof. The idea was actually hinted in Jiang [(2010), page 550] as "evi- 
dence" that supports a positive answer (see the last paragraph of Section 1.2 
of the current paper). Basically, the idea suggests that, perhaps, one could 
use the fact that the MLE based on the subset data is consistent to argue 
that the MLE based on the full data is also consistent. The question is how 
to execute the idea. Recall that, in the original proof of Wald [(1949); also 
see Wolfowitz (1949)], the focus was on the likelihood ratio Po{y) / Po (y) , 
and showing that the ratio converges to zero outside any (small) neighbor- 
hood of #0) the true parameter vector. Can we execute the subset idea in 
terms of the likelihood ratio? This leads to consideration of the relationship 
between the likelihood ratio under the full data and that under the subset 
data. It is in this context that the following subset inequality (2) is derived 
(see Section 5.1 for further discussion), which is the key to the proof. 

Let y[i] denote the (row) vector of yu, i = 1, . . . ,m Are, and yr 2 i the (row) 
vector of the rest of the yij, i = 1, . . . , m, j = 1, . . . , n. Let p^ (y[i], y\2] ) denote 
the probability mass function (p.m.f.) of (y[i],y\2]), Pfi{y\i}) the p.m.f. of j/m, 

Pn(y[i],y[2]) 



(!) Pti{y[2]\y[ii 



p^{y\i]) 



the conditional p.m.f. of yr 2 i given ynj , and P M the probability distribution, 
respectively, when fi is the true parameter. For any e > 0, we have 



{p,x (vii] , vp] ) < p»+e (y { x] , y { 2] ) b[i] } = P M { P lll^y^) - 1 



V[i] 
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< E 



Pfi+e(y[i],y[2]) 



Pi*(y[i\,y[2]) 



f[i] 



/ n ^ p^+ e (y[i]»y[2]) , , , 
( 2 = z_> — ? ^p^y[2]\y[i]) 



E 



p^+e{y[i],y[2]) 



= P/H-e (*/[!]) 

Pn{y\i}) 

using (1). A more general form of (2) is given in Section 5.1. 

On the other hand, by the standard asymptotic arguments [e.g., Jiang 
(2010), page 9], it can be shown that the likelihood ratio p^ +£ {y^) /p^iysW) 
converges to zero in probability, as mA?i->oo. Here we use the fact that 
the components of j/m, ya, 1 < i < m A n are independent Bernoulli random 
variables. It follows that, for any r/ > 0, there is N„ > 1 such that, with 
probability > 1 - r), we have Cat = P^{p^(y[i],y[2]) < P/j,+e{y[i],y[2])\y[i]} < 
7 mAn for some < 7 < 1, if m A n > N„. The argument shows that (jv = 
Op(7 mAn ), hence converges to in probability. It follows, by the dominated 
convergence theorem, that E^Ctv) = P M {p A1 (y[i], 2/[2]) < Pti+e(y[i], V[2])} -> 0. 
Similarly, we have P M {p /i (y[i], y[ 2 ]) < p M - e (y[i], y[2})} ^0. The rest of the 
proof follows by the standard arguments [e.g., Jiang (2010), pages 9-10]. □ 



The result of Theorem 1 is usually referred to as Cramer-type consis- 
tency [Cramer (1946)], which states that a root to the likelihood equation 
is consistent. However, it does not always imply that the MLE, which by 
definition is the (global) maximizer of the likelihood function, is consistent. 
A stronger result is called Wald-type consistency [Wald (1949); also see 
Wolfowitz (1949)], which states that the MLE is consistent. Note that the 
limiting process in Theorem 1 is m,n — > 00, or, equivalently, mAn—^oo (see 
Section 5.4 for discussion). With a slightly more restrictive limiting process, 
the Wald-consistency can actually be established, as follows. 

Theorem 2 (Wald consistency). //(mAn) _1 log(mVn) — > as m,n— > 
00, then the MLE of [i is consistent. 

Proof. Define po(X) = E{h(X + £)}, where h(x) = e x /(l + e x ) and £ ~ 
N(0, 2). Write po =po(fi). For any integer k, divide the interval [k,k + 1) 
by Afcj = k + 5{mn)~ l {m A n)j, j = 1, . . . , J, where J = [mn/5(m A n)] and 
< 5 < 1 — po. It is easy to show that \(d/dfj,)logpp(yM,y\ 2 ])\ < run uni- 
formly for all [i. Thus, for any AG [k, k + 1), there is 1 < j < J, such 
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that logpA(j/[i],2/[2]) - ^gpx kij (y[i],y[2]) = {(5/5//)logp At (y[i ] ,i/[ 2 ])| At= ^}(A 
Afcj) < S(m A n), where A lies between A and Xk,j- It follows that 

PA (*/[!], 2/[ 2] ) S(mAn) ^ PA fc ,>[l] , ^[2] ) 
AG[fc,fc+l)PM^[l]'f[2]) 1 ^'^ J *W[l].2/[2]) 



Therefore, by the subset argument [see (2)], we have 



P J sup g^MiM? > i 

LAe[fe,fc+l)^M^[l]'2/[2]J 



(3) <E P m 



Pn(V[i],V[2\) 



V[i] 



< e <5(m.An) P*k,j(.y[i\) 

On the other hand, we have < 1 — po(A) = E{1 + exp(A + < 
e" A E(e-«) = e 1 ~ x ; and, similarly, < p (A) < e 1+x . Let A s = < 5} 
with A = (m An)" 1 X^I^i™ Vu ~ Po- If A; > 1, then, for any 1 < j < J, write 
Pi =Po(Afcj). We have, on ^j, 

pxjJm) f/piy o+A /i-piy- p °- A i mAn 

Pm(#[1]) l\P0/ \l-Po/ 

<{a7 1 (l-pi) 1 - p °-*} mAn 
<[a7 1 e X p{(l-A fcj )(l-p -<5)}] ? 
< exp[{l -p - <5 - loga 5 - (1 -p - W(mAn)], 
where aj = inf| a .i< i ^ 0+x (l - po) 1 ^ 0- * > 0. It follows, by (3), that 



imAn 



Pi sup P -* M !>1 



Ue[fc,fc+i)P/i(2/[i]>3/[2]) 

77177- 

- 77 — 7 — r ex P[{! - Po - loga 5 - (l-po-S)k}(mAn)] 
d{m A raj 



on As, or, equivalently, that 

Jl) 

V[i] 



pa (yp] ,y[2]) , AI . 

Ue[fc,fc+1) PmU/[i]>2/[2]J 



(4) 

TfiTl 

- 77 77^7 ex P[{! - Po - loga a - (1 - po - 5) A:} (m A ra)] 1^ . 

o(m A n) 
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Note that As £ J~{y\\])- By taking expectations on both sides of (4), it follows 
that the unconditional probability corresponding to the left side is bounded 
by the right side without 1^, for k = 1,2, Therefore, we have 

„ f P\(V[1],V[2]) 1 , , N „ UI/ , 

P M < sup — — u — ^ > 1 for some k > K, \A\ < 5 

Ue[fc,fc+i) PAt(Z/[i]>y[2]) 

^ „ f Pa(2/[i],2/[21 



fc=Ff 



< > ,P M <| sup >1,|A[<(5 



oo 

^ x ex P{(! -Po " loga 5 )(m A n)} V e -(i-Po-*)(mAn)fc 

dim f\n) 

k=K 

mn e -(l-po-5)(rnAn)K 

(5) = T7 \ ex Pl(l ~ Po — log a$)(m A n)\ T . — 

W (S(mAfl) 1 - e -(l-P0-<5)(mAn) 

= {1 — g-Ci-Po-^CwiAn)!-! 

x exp[— (m A n){(l — Po — <f)-K" — 1 + Po + loga<5 

— (m A n)~ l log(m V n) + (m A n) _1 log<5}]. 

Thus, if we choose X such that (1 — po — 5)K — 1 +po + loga,5 > 1, then, for 
large m An, the probability on the left side of (5) is bounded by 2e~( mAn ^ 2 . 
On the other hand, we have P^(^) — > 0, as m A n — > oo. Thus, we have 

p\(y [i],y[2] 



(6) 



> 1 for some A > K 



Pn(V[l],V[2]) 

< 2e -(^An)/2 +p ^^ 



as m An — > oo. Similarly, the left side of (6), with the words "A > K" replaced 
by "A < —K" goes to zero, as m A n — >• oo, if K is chosen sufficiently large. 

On the other hand, again by the subset argument, it can be shown (see the 
supplementary material [Jiang (2013)]) that for any e > and K > \fi\ + e, 
we have 

(7) Pj sup "fffiMkA^o 

l\e[-K,»-eM»+e,K]Pii{y[i\,y[2]) J 

as m,n — > oo. The consistency of the MLE then follows by combining (7) 
with the previously proved results. □ 

3. Beyond. We consider a few more applications of the subset argument, 
introduced in the previous section. All applications are regarding a general 
GLMM, whose definition is given below for the sake of completeness [see, 
e.g., Jiang (2007) for further details]. 
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(i) Suppose that, given a vector u of random effects, responses yi, • • . ,yjv 
are conditionally independent with conditional density function, with re- 
spect to a cr-finite measure v, given by the exponential family fi(y%\u) = 
exp{a~ 1 ((f)){yi£i — b(!;i)} + Ci(yi, </>)], where <fi is a dispersion parameter (which 
in some cases is known), and &(•), aj(-), Cj(-, •) are known, continuously dif- 
ferentiable functions with respect to £j and </>. The natural parameter of 
the conditional exponential family, £j, is therefore associated with the con- 
ditional mean, /ij = E(yj|n), according to the properties of the exponential 
family [e.g., McCullagh and Nelder (1989), Section 2.2.2]. (ii) Furthermore, 
suppose that /ij satisfies g(fJ.i) = x[f3 + z^u, where known vectors, (5 

is a vector of unknown parameters, and g(-) is a link function, (iii) Finally, 
assume that u ~ N(0, G), where the covariance matrix G may depend on a 
vector ip of dispersion parameters. 

It is typically possible to find a subset of the data that are independent, in 
some way, under a general GLMM. For example, under the so-called ANOVA 
GLMM [e.g., Lin (1997)], a subset of independent data can always be found. 

Here an ANOVA GLMM satisfies g(ji) = X/3 + Zmi H h Z s u s , where (i = 

((H)i<i<N, g{n) = [g(Vi)]i<i<N, X = (x<)i<i<jv, Z r = (z' ir )i<i< N , 1 < r < s, 
are known matrices, u r , 1 < r < s are vectors of independent random effects, 
and u\,...,u s are independent. Examples 1 and 2 are special cases of the 
ANOVA GLMM. Note that in both examples the responses are indexed by 
instead of i, but this difference is trivial. Nevertheless, the "trick" is 
to select a subset, or more than one subsets if necessary, with the following 
desirable properties: (I) the subset (s) can be divided into independent clus- 
ters with the number(s) of clusters increasing with the sample size; and (II) 
the combination of the subset(s) jointly identify all the unknown parame- 
ters. More specifically, let 2/ 4 - a \« = 1, . . . , N a be the ath subset of the data, 
1 < a < b, where b is a fixed positive integer. Suppose that, for each a, there 
is a partition, {1, . . . , N a } = |J^i S a j . Let y a j = [y^]ies aJ , and pe{y a ,j) be 
the probability density function (p.d.f.) of y a j, with respect to the measure 
v (or the product measure induced by v if y a j is multivariate), when 9 is 
the true parameter vector. Let O denote the parameter space, and #o the 
true parameter vector. Then, (I) and (II) can be formally stated as follows: 

(Al) y a j, 1 < j < m a are independent with m a oo as iV - > oo,l <a<b; 
(A2) for every 6 G \ {#o}> we have 




Note that (A2) controls the average Kullback-Leibler information [Kull- 
back and Leibler (1951)]; thus, the inequality always holds if < is replaced 
by <• 
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3.1. Finite parameter space. Let us first consider a simpler case by as- 
suming that is finite. Although the assumption may seem restrictive, it 
is not totally unrealistic. For example, any computer system only allows a 
finite number of digits. This means that the parameter space that is practi- 
cally stored in a computer system is finite. Using the subset argument, it is 
fairly straightforward to prove the following (see the supplementary material 
[Jiang (2013)]). 

Theorem 3. Under assumptions (Al) and (A2), if, in addition, 
(A3) for every 9 € \ {9q}, we have 

Pe{Va,j) 



mi 
a i=i 



var 0o 



log 



Pe (Va,j) 



0. 



1< a < b, 



then Pe (9 = O ) — > 1, as N — > oo, where 9 is the MLE of 9. 



3.2. Euclidean parameter space. We now consider the case that is a 
convex subspace of R d , the d-dimensional Euclidean space, in the sense that 
9\ , 92 £ implies (l — t)0i + t0<z E for every t € (0,1). In this case, we need 
to strengthen assumptions (A2), (A3) to the following: 

(B2) 9q € 0°, the interior of 0, and there is < M < oo [same as in (B3) 
below] such that, for every e > 0, we have 



j m a 

(8) lim sup sup min — N, Eflo 

N^oo e&@,e<\B-e Q \<M l < a < bm a J~ ^ 



log 



Pe{y a ,j) 



<0. 



Pe (y a ,j) 

(B3) There are positive constant sequences sjv, s a ,N, I < a <b such that 



(9) 



sup max 

e&e,\e-e \<M 1 < c < d 



d 

— \og{p e {y)} 



Op (sat) 



with log(sj V )/min 1 < a < 6 m a ->• 0, where p (y) is the p.d.f. of y = (yi)i<i<N 
given that 9 = (# c )i< c <d is the true parameter vector, 



(10) sup — y max 

6»eG,|6»-6» |<A/ m a ^ l<c<d 



_d_ 



\og{pe{y a ,j)} 



Op(s a ,v) 



with log(s aj Ar)/m a — > 0; and (for the same s a) jv) 



(11) 



„<2-l m a 
o,JV 



sup , ^xE varfl o 



log 



Pe(y a ,j) 
Pe {y a ,j) 



0. 



1< a < 6. 
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Theorem 4. Under assumptions (Al), (B2) and (B3), there is, with 

„ P 

probability — > 1, a root to the likelihood equation, 9, such that 9 — > 9q, as 
N ^oo. 

Proof. Aside from the use of the subset argument, the lines of the proof 
are similar to, for example, the standard arguments of Lehmann and Casella 
[(1998), the beginning part of the proof of Theorem 5.1], although some 
details are more similar to Wolfowitz (1949). We outline the key steps below 
and refer the details to the supplementary material [Jiang (2013)]. Once 
again, the innovative part is the consideration of the conditional probability 
given the subset data and, most importantly, the subset inequality (15) in 
the sequel. 

For any e > 0, assume, without loss of generality, that {9 : \0 — 0q\ < e} C 
and C £ = {6 e R d : \9 C - 9 0c \ < e, 1 < c < d} C {6 e 6 : \6 - 6 \ < M}. Essen- 
tially, all we need to show is that, as N — > oo, 

(12) P(e)=P e Jp eo (y) < sup p e (y)\ — ► 0, 

L eedc E J 

where dC e is the boundary of C e , which consists of 9 € C e such that \9 C — 
9q c \ = e for some 1 < c < d. Define 

Sn,o,(9) = — / E 9o 
m a f-f 

and In (9) = min{l < a < b:SN, a (6) = nihii< a '<;, Sn,o,'(6)}- Then, dC £ = 
ULi dC e n Ar,a, where 8^ = ^66: I N {9) = a}. Then, we have 

b 

(13) P(e)<J2Ve {pe (y)< sup Pe (y)\. 

For a fixed 1 < a < b, let 5 be a small, positive number to be determined 
latter, and K = [e 8rria ] + 1. For any I = (h, . . . , l d ), where < l c < K - 1, 1 < 
c<d, select a point 9\ from the subset {9 : 9q c — £ + 2el c /K <9 C < 9q c — e + 
2e(l c + l)/K, 1 <c<d}n dC e n Gat^, if the latter is not empty; otherwise, do 
not select. Let D denote the collection of all such points. Also let B denote 
the left side of (9). It can be shown that 

Pe \p9 (y)< sup P e{y)\ 
L 9&dc E ne Na J 

(14) 

< p e {exp(^|^ > 2 1 +P 8o {pe (y) < 2maxp e (y)}. 

We now apply the subset argument. Let ym denote the combined vector 
of y a j, 1 < j < m a , and ypi the vector of the rest of y\, . . . ,yjv- Then, similar 



log 



Pe{y a ,j) 

P0n(ya,i) 



Ka<b, 
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to the argument of (2), we have, for any 9 € D, 

(15) P«,{Wb(v) < %»(y)\v\i]} < 2^7^- 

P0 O yy[i}) 

Using this result, it can be shown that Pg {po (y) < 2maxg € £) Pe(y)\y[i]} = 
op(l). Prom here, (12) can be established. □ 

Again, Theorem 4 is a Cramer-consistency result. On the other hand, 
Wald-consistency can be established under additional assumptions that con- 
trol the behavior of the likelihood function in a neighborhood of infinity. 
For example, the following result may be viewed as an extension of The- 
orem 2. The proof is given in the supplementary material [Jiang (2013)]. 
Once again, the subset argument plays a critical role in the proof. For 
simplicity, we focus on the case of discrete responses, which is typical for 
GLMMs. In addition, we assume the following. For any < v < w, define 
Sd[v, w) = {x G R d : v < \x\ < w} and write, in short, Sd(k) = Sd[k, k + 1) for 
k = l,2,.... 

(CI) There are sequences of constants, bk,CN > 1, and random variables, 
C/V) where cn,(n do not depend on k, such that (n = Op(l) and 



sup max 

6>eens d [fc-i,fc+2) l < c < d 



d 

— log{ Pe {y)} 



<hcN(N, A; = 1,2,... 



(C2) There is a subset of independent data vectors, ytj\, 1 < j < [not 
necessarily among those in (Al)] so that: (i) Eg \ log{pj,6> (2/Q-))}| is bounded, 
Pj${-) being the p.m.f. of under 6; (ii) there is a sequence of positive 
constants, 7^, with limfc_ >00 7fc = 00, and a subset Tn of possible values of 
such that for every k > 1 and 9 G O D Sd(k), there is t £ Tn satisfying 
maxi^m^log-tpj^)} < -7 fe ; (iii) inf terjv m^ 1 Y^iPjA^) - 9 for some 
constant p > 0; and (iv) \T N \/m N = o(l), and c d N YX=K k dl bf.e- SniN ^ k = o(l) 
for some K > 1 and 5 < p, where d\ = dl^d>i) ■ 

It is easy to verify that the new assumptions (CI), (C2) are satisfied in 
the case of Theorem 2 for the open problem (see the supplementary material 
[Jiang (2013)]). Another example is considered in the next section. 

Theorem 5. Suppose that (Al) holds; (B2), (B3) hold for any fixed 
M > (instead of some M > 0), and with the ^ in (11) replaced by s d jy. 
In addition, suppose that (CI), (C2) hold. Then, the MLE of9o is consistent. 

4. Example. Let us consider a special case of Example 1 with x'^fi = /i, 
but <t 2 and r 2 unknown. We change the notation slightly, namely, yij,k 
instead of yijk- Suppose that S = S\ US2 such that = r, (i, j) £ S r , r = 1, 2 
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(as in the case of the salamander data). We use two subsets to jointly identify 
all the unknown parameters. The first subset is similar to that used in the 
proofs of Theorems 1 and 2, namely, y^ = {yi^k)k=l,2, (m) G *f>2- Let m\ be 
the total number of such (i,i)'s, and assume that mi — > 00, as m,n — > 00. 
Then, the subset satisfies (Al). Let 9 = (/i,c 2 ,r 2 )' . It can be shown that 
the sequence yi,i,(i,i) £ S2 is a sequence of i.i.d. random vectors with the 
probability distribution, under 6, given by 

exp{yi,i. (// + £)} 



(16) 



Pe{yi 



E 



{l + exp(/i + £)} 2 

where £ ~ A(0, ?/> 2 ), with V> 2 = o" 2 + r 2 , and y, 
concavity of the logarithm, we have 

Pe(Vi,i 



yi,i,i + yLL2- By the strict 



(17) 



E e 



log 



<0 



unless Pe(yi,i) / Pe Q (yi,i) is a.s. Pg a constant, which must be one because 
both pq and pg are probability distributions. It is easy to show that the 
probability distribution of (16) is completely determined by the function 
Af(0) = [M r (0)] r =i,2, where M r ($) = E{h r #(()} with = MO = 

exp(/i + ipQ/{l + exp(/z + ^C)}j an d C ~ A(0, 1). In other words, pe(yi,i) = 
Pe (yi,i) f° r an values of y^i if and only if M(§) = M(i?o)- Jiang (1998) 
showed that the function M(-) is injective [also see Jiang (2007), page 221]. 
Thus, (17) holds unless \i = fio and i/j 2 = tpQ. 

It remains to deal with a 9 that satisfies fi = hq , ip 2 = t/)Q, but 9 ^9q. For 
such a 6>, we use the second subset, defined as yj = (2/1,21-1,1,2/1,21,1)' such 
that (i, 2i — 1) S S* and (i, 2i) S 5. Let m<i be the total number of all such 
Vs, and assume that mi — > 00 as m,n — > 00. It is easy to see that (Al) is, 
again, satisfied for the new subset. Note that any 9 satisfying /x = fj,Q and 
tp 2 = ipQ is completely determined by the parameter 7 = a 2 ftp 2 . Furthermore, 
the new subset is a sequence of i.i.d. random vectors with the probability 
distribution, under such a 9, given by 

'exp{j/ i2 i-i,i(/io + A)} exp{y» 2i,i(/io + 



(18) P ,(y i )-E [ 1 + exp(/UQ + x) i + exp^o + F) 

where (X,Y) has the bivariate normal distribution with var(X) 
ipQ and cor(X, Y) = 7. Similar to (17), we have 



var(y) 



(19) 



E 



To 



log< 



<0 



IPloiVi) 

unless p^(yi) =P- ro {yi) for all values of j/j. Consider (18) with = (1, 1) and 
let P 7 denote the probability distribution of (X,Y) with the correlation 
coefficient 7. By Fubini's theorem, it can be shown that 



(20) p 7 (l,l) 



P 7 {X > logit(s) -fi Q ,Y> logit(t) - fi Q }dsdt. 
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Hereafter, we refer the detailed derivations to the supplementary material 
[Jiang (2013)]. By Slepian's inequality [e.g., Jiang (2010), pages 157-158], 
the integrand on the right side of (20) is strictly increasing with 7, hence so 
is the integral. Thus, if 7 7^70, at least we have p 7 (l, 1) ^ p 10 {l,l), hence 
(19) holds. 

In summary, for any 9 £ 0, 9 7^ 9q, we must have either (17) or (19) hold. 
Therefore, by continuity, assumption (B2) holds, provided that true vari- 
ances, <7q,Tq are positive. Note that, in the current case, the expectations 
involved in (B2) do not depend on either j or N, the total sample size. 

To verify (B3), it can be shown that \(d/dfj,)log{pg(y)}\ < N. Further- 
more, we have \ (d/da 2 ) log{p g (y)}\ V \{d/dr 2 ) log{p e {y)}\ < {A + C + 1)N in 
a neighborhood of 9q, M(9q). Therefore, (9) holds with sn = N. 

As for (10), it is easy to show that the partial derivatives involved are 
uniformly bounded for 9 gM(9q). Thus, (10) holds for any s at N such that 
Sa,N — > 00, a = 1, 2. Furthermore, the left side of (11) is bounded by c a s 2 N /m ( 
for some constant c a > 0, a = 1,2 (note that d = 3 in this case). Thus, for 
example, we may choose s a ,N = v {1 + log(m a )}, a = 1,2, to ensure that 
log(s aj 7v)An a — >• 0, a = 1,2, and (11) holds. 

In conclusion, all the assumptions of Theorem 4 hold provided that a 2 > 0, 
Tq > 0, and (mi A m 2 ) _1 log(iV) ->• 0. 

Similarly, the conditions of Theorem 5 can be verified. Essentially, what is 
new is to check assumptions (CI) and (C2). See the supplementary material 
[Jiang (2013)]. 



5. Discussion. 



5.1. Remark on subset argument. In proving a number of results, we 
have demonstrated the usefulness of the subset argument. In principle, the 
method allows one to argue consistency of the MLE in any situation of 
dependent data, not necessarily under a GLMM, provided that one can 
identify some suitable subset(s) of the data whose asymptotic properties 
are easier to handle, such as collections of independent random vectors. 
The connection between the full data and subset data is made by the subset 
inequality, which, in a more general form, is a consequence of the martingale 
property of the likelihood-ratio [e.g., Jiang (2010), pages 244-246]: suppose 
that Y\ is a subvector of a random vector Y. Let pe(-) and pi t g(-) denote the 
p.d.f.'s of Y and Y\, respectively, with respect to a cr-finite measure v, under 
the parameter vector 9. For simplicity, suppose that pe ,Pi,e are positive 
a.e. v, and A(-) is a positive, measurable function. Then, for any 9, we have 

Pe {P8 (Y) < X(Y 1 )p (Y)\Y 1 } < m)^^- a.e. v, 

where Pg denotes the probability distribution corresponding to pg . 
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5.2. Quantifying the information loss. On the other hand, the subset 
argument merely provides a method of proof for the consistency of the full- 
data MLE — it by no means suggests the subset-data MLE as a replacement 
for the full-data MLE. In fact, there is an information loss if such a replace- 
ment takes place. To quantify the information loss, assume the regularity 
conditions for exchanging the order of differentiation and integration. Then, 
the Fisher information matrix based on the full data can be expressed as 



I f (0) 



2 



3989 



7 logp e (y) 



Eg 







log p e (y) 



de 

ii,i(9) - ii, 2 (e). 



0_ 

06 



log pe(y) 



Eq 



l 



2 



Pe(y) 09 d9 



-,pe(y) 



Similarly, the information matrix based on the subset data can be expressed 
as I B {9) =I s ,i(6>) - I Sj2 (9), where I s j(9) is h,j{9) with y replaced by y^, 
j = 1,2 [pe{y\i]) denotes the p.d.f. (or p.m.f.) of y\i]\- By conditioning on 
2/[ij, it can be shown that I{^(9) = / Sj 2(#), while h^\{9) > / Sj i(#). It follows 
that 



(21) 



It(9)>I s (9) 



for all 9. Here the inequality means that the difference between the left 
side and right side is a nonnegative definite matrix. (21) suggests that the 
information contained in the full data is no less than that contained in the 
subset data, which, of course, is what one would expect. Furthermore, the 
information loss is given by 



(22) 



I { (9)-I s (9) = E e 



T^logpe(y) 



V[i] 



where Var^-jy^]) denotes the conditional covariance matrix given ym un- 
der 9. The derivations of (21) and (22) are deferred to the supplementary 
material [Jiang (2013)]. It is seen from (22) that the information loss is 
determined by how much (additional) variation there is in the score func- 
tion, (d / 89) log po(y), given the subset data ym. In particular, if ym = 
y, then the score function is a constant vector given ym (and 9); hence 
Vavg{(d/d9) logpo(y)\y^} = 0, thus, there is no information loss. In gen- 
eral, of course, the subset data yni is not chosen as y; therefore, there will 
be some loss of information. 

Nevertheless, the information contained in the subset data is usually suf- 
ficient for identifying at least some of the parameters. Note that consistency 
is a relatively weak asymptotic property in the sense that various estimators, 
including those based on the subset data and, for example, the method of 
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moments estimator of Jiang (1998), are consistent, even though they may 
not be asymptotically efficient. Essentially, for the consistency property to 
hold, one needs that, in spite of the potential information loss, the remain- 
ing information that the estimator is able to utilize grows with the sample 
size. For example, in the open problem (Sections 1 and 2), the information 
contained in ya grows at the rate of m An, which is sufficient for identifying 
u; in the example of Section 4, the information contained in y^j grows in 
the order of m\ , which is sufficient for identifying a and ip 2 = a 2 + r 2 , while 
the information contained in j/j grows at the rate of 1712, which is sufficient 
for identifying 7 = a 2 /ip 2 . The identification of the "right" subset in a given 
problem is usually suggested by the nature of the parametrization. As men- 
tioned (see the third paragraph of Section 3), a subset ym of independent 
data can always be found under the ANOVA GLMM (e.g., starting with 
the first observation, yi, one finds the next observation such that it involves 
different random effects from those related to y±, and so on). If the ym is 
such that liminfAr^oo A m i n {/ s (0)} = 00, where I s {0) is as in (21) and A m i n 
denotes the smallest eigenvalue, the subset yni is sufficient for identifying all 
the components of 9; otherwise, more than one subsets are needed in order 
to identify all the parameters, as is shown in Section 4. 

5.3. Note on computation of MLE. The subset argument offers a pow- 
erful tool for establishing consistency of the MLE in GLMM with crossed 
random effects. Note that the idea has not followed the traditional path 
of attempting to develop a (computational) procedure to approximate the 
MLE. In fact, this might explain why the computational advances over the 
past two decades [see, e.g., Jiang (2007), Section 4.1 for an overview] had 
not led to a major theoretical breakthrough for the MLE in GLMM in terms 
of asymptotic properties. Note that the MLE based on the subset data is 
a consistent estimator of the true parameter, and in that sense it is an ap- 
proximation to the MLE based on the full data (two consistent estimators of 
the same parameter approximate each other). However, there is an informa- 
tion loss, as discussed in the previous subsection [see (22)], so one definitely 
wants to do better computationally. 

One computational method that has been developed for computing the 
MLE in GLMMs, including those with crossed random effects, is Monte 
Carlo EM algorithm [e.g., McCullogh (1994, 1997), Booth and Hobert (1999)]. 
Here, however, we would like to discuss another, more recent, computational 
advance, known as data cloning [DC; Lele, Dennis and Lutscher (2007), Lele, 
Nadeem and Schmuland (2010)]. The DC uses the Bayesian computational 
approach for frequentist purposes. Let tt denote the prior density function 
of 8. Then, one has the posterior, 

(23) 7 r(%) = ^#^, 
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where p(y) is the integral of the numerator with respect to 9, which does 
not depend on 9. There are computational tools using the Markov chain 
Monte Carlo for posterior simulation that generate random variables from 
the posterior without having to compute the numerator or denominator of 
(23) [e.g., Gilks, Richardson and Spiegelhalter (1996); Spiegelhalter et al. 
(2004)]. Thus, we can assume that one can generate random variables from 
the posterior. If the observations y were repeated independently from K 
different individuals such that all of these individuals result in exactly the 
same data, y, denoted by y^ = (y, . . . , y), then the posterior based on y( K ^ 
is given by 

Lele, Dennis and Lutscher (2007), Lele, Nadeem and Schmuland (2010) 
showed that, as K increases, the right side of (24) converges to a multi- 
variate normal distribution whose mean vector is equal to the MLE, 9, and 
whose covariance matrix is approximately equal to K~ 1 I^ 1 (9). Therefore, 
for large K, one can approximate the MLE by the sample mean vector of, 
say, . . . , #( B ) generated from the posterior distribution (24). Denoted 
the sample mean by 9^'\ and call it the DC MLE. Furthermore, If 1 (9) [see 
(21), (22)] can be approximated by K times the sample covariance matrix 
of 9^\ . . . , 9^ B \ Torabi (2012) successfully applied the DC method to obtain 
the MLE for the salamander-mating data. 

Note that the DC MLE is an approximate, rather than exact, MLE, in 
the sense that, as K — > oo, the difference between 9^ and the exact MLE 
vanishes. Because we have established consistency of the exact MLE, it fol- 
lows that the DC MLE is a consistent estimator as long as the number K 
increase with the sample size. More precisely, it is shown in the supplemen- 
tary material [Jiang (2013)] that, for every e, 5 > 0, there is N Et g such that for 
every n > iV £i5 and B > 1, there is K(n, B) such that Y{\9^ - 9 \ >e}<5, 
if K > K(n,B), where 9q is the true parameter vector. Note that, as far as 
consistency is concerned, one does not need that B goes to infinity. This 
makes sense because, as K — > oo, the posterior (24) is becoming degenerate 
[the asymptotic covariance matrix is K~ 1 If 1 (9)]; thus, one does not need 
a large B to "average out" the variation in Thus, from an asymptotic 
point of view, the result of the current paper provides a justification for the 
DC method. 

More importantly, because B, K are up to one's choice, one can make sure 
that they are large enough so that there is virtually no information loss, as 
was concerned earlier. In this regard, a reasonably large B would reduce the 
sampling variation and therefore improve the DC approximation, and make 
the computation more efficient. See Lele, Nadeem and Schmuland (2010) for 
discussion on how to choose B and K from practical points of view. 
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As for the prior ir, Lele, Nadeem and Schmuland (2010) only suggests 
that it be chosen according to computational convenience and be proper (to 
avoid improper posterior). Following the subset idea, an obvious choice for 
the prior would be the multivariate normal distribution with mean vector 6 S , 
the subset-data MLE, and covariance matrix I~ 1 (6 S ) [defined above (21)]. 
Note that I s (6) is much easier to evaluate than lf(0). This would make 
the procedure more similar to the empirical Bayes than the hierarchical 
one. Nevertheless, the DC only uses the Bayesian computational tool, as 
mentioned. 

5.4. Regarding the limiting process. In some applications of GLMM, the 
estimation of the random effects are of interest. There have also been de- 
velopments in semiparametric GLM and nonparametric ANOVA. In those 
cases, the random effects are treated the same way as the fixed effects. As 
a result, the proof of the consistency results in those cases usually impose 
constraints on the ratio of the number of effects and number of observations 
falling in each cluster [e.g., Chen (1995), Jiang (1999), Wu and Liang (2004), 
and Wang, Tsai and Qu (2012)]. A major difference exists, however, between 
the case of clustered data (e.g., Example 2) and that with crossed random 
effects (e.g., Example 1) in that, in the latter case, the data cannot be di- 
vided into independent groups (with the number of groups increasing with 
the sample size). Furthermore, the necessary constraints are very different 
depending on the interest of estimation. Consider, for example, a very simple 
case of linear mixed model, y^- = fi + Ui + Vj + eij , i = 1, . . . , m, j = 1, . . . , n, 
where the Uj's and i>j's are random effects, and e^s are errors. Assume, for 
simplicity, that all the random effects and errors are i.i.d. ^(0,1), so that 
fi is the only unknown parameter. Suppose that n — > oo, while m is fixed, 
say, m = 1. In this case, y±. = n _1 Y^j=\ V±i = fJ- + ux + v. + e\. is a consistent 
estimator of the cluster mean, [i\ = [i + u\ . On the other hand, the MLE 
of /U, which is also yi., is inconsistent (because it converges in probability 
to [i + ui, which is not equal to /j, with probability one). Note that here 
the ratio of the number of effects and number of observations in the cluster 
is 2/n. Apparently, this is sufficient for consistently estimating the mixed 
effect fi + ui, but not the fixed effect /i. One might suspect that the case 
m = 1 is somewhat extreme, as fi and u± are "inseparable"; but it does 
not matter. In fact, for any m > 1, as long as it is fixed, the MLE of /j, is 
y.. = (mn)" 1 YliLi Sj=i Vij = f^+u.+v.+e,,, which converges in probability 
to fi + u. as n — y oo, and fi + u. ^ [i with probability one. Thus, the only way 
that the MLE of fi can be consistent is to have both m and n go to oo. 

The example also helps to explain why it is necessary to consider the 
limiting process mAn-)-oo, instead of something else, in the open problem. 
The result of Theorem 1 shows that m A n — > oo is also sufficient for the 
consistency of the MLE. In fact, from the proof of Theorem 1 it follows that, 
for large m,n, we have with probability tending to one that the conditional 
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probability that Pn(y) < p^+ £ (y) given j/m is bounded by ^ mhn for some 
constant < 7 < 1. The corresponding upper bound under Theorem 3 is 
e -\m a £ or gome con stant A > 0, where m a is the number of independent 
vectors in the subset yni, and a similar result holds under Theorem 4 with 
the upper bound being exp[— Am a {l + o(l)}]. The assumption of Theorem 3, 
namely, (Al), makes sure that to* = mini< a <{,m a — > 00 as the sample size 
increases; the assumptions of Theorem 4, namely, (Al) and (B3), make sure 
that, in addition, the o(l) in the above vanishes as to* — > 00. 

Although estimation of the random effects is not an objective of this 
paper, in some cases this is of interest. For example, one may consider esti- 
mating the conditional mean of yij given m in the open problem (which may 
correspond to the conditional probability of successful mating with the ith 
female in the salamander problem). As mentioned, the data are not clustered 
in this case; in other words, all the data are in the same cluster, so the ratio 
of the number of effects over the number of observations is (l + m + n)/mn = 
m~ 1 + n~ 1 + (mn) -1 , which goes to zero as mAn^ 00. It is easy to show that 
y~i. = n~ l X)j=i Vij m a consistent estimator of E^yjj |itj) = E{/i(/i + m + 77) }, 
where h(x) = e x /(l + e x ) and the (unconditional) expectation is with re- 
spect to r] ~ N(0, 1), 1 <i <m. Similarly, y.j = m~ l YliLi Vij ^ s a consistent 
estimator of Ei^(yij\vj) = E{h(fi + £ + Vj)}, where the (unconditional) expec- 
tation is with respect to £ ~ N(0, 1), 1 < j < n. 
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